DOMAIN: Telecom
CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.
• DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
The data set includes information about:
• Customers who left within the last month – the column is called Churn
• Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
• Demographic info about customers – gender, age range, and if they have partners and dependents
•PROJECT OBJECTIVE: To Build a model that will help to identify the potential customers who have a higher probability to churn. This helps the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategizing customer retention.
Import All the Libraries
Revision History: 12-12-2021 - Label Encoding enhanced
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score, confusion_matrix
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#from sklearn.feature_extraction.text import CountVectorizer #DT does not take strings as input for the model fit step.
Data Understanding & Exploration: [5 Marks]
A. Read ‘TelcomCustomer-Churn_1.csv’ as a DataFrame and assign it to a variable. [1 Mark]
churn1 = pd.read_csv('TelcomCustomer-Churn_1.csv') # import the csv file
churn1.head(10)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No |
| 5 | 9305-CDSKC | Female | 0 | No | No | 8 | Yes | Yes | Fiber optic | No |
| 6 | 1452-KIOVK | Male | 0 | No | Yes | 22 | Yes | Yes | Fiber optic | No |
| 7 | 6713-OKOMC | Female | 0 | No | No | 10 | No | No phone service | DSL | Yes |
| 8 | 7892-POOKP | Female | 0 | Yes | No | 28 | Yes | Yes | Fiber optic | No |
| 9 | 6388-TABGU | Male | 0 | No | Yes | 62 | Yes | No | DSL | Yes |
churn1.shape
(7043, 10)
Observation 1: There are 7043 Observations / Rows and 10 Attributes / Columns in churn1 dataset
B. Read ‘TelcomCustomer-Churn_2.csv’ as a DataFrame and assign it to a variable. [1 Mark]
churn2 = pd.read_csv('TelcomCustomer-Churn_2.csv') # import the csv file
churn2.head(10)
| customerID | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| 5 | 9305-CDSKC | No | Yes | No | Yes | Yes | Month-to-month | Yes | Electronic check | 99.65 | 820.5 | Yes |
| 6 | 1452-KIOVK | Yes | No | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 89.10 | 1949.4 | No |
| 7 | 6713-OKOMC | No | No | No | No | No | Month-to-month | No | Mailed check | 29.75 | 301.9 | No |
| 8 | 7892-POOKP | No | Yes | Yes | Yes | Yes | Month-to-month | Yes | Electronic check | 104.80 | 3046.05 | Yes |
| 9 | 6388-TABGU | Yes | No | No | No | No | One year | No | Bank transfer (automatic) | 56.15 | 3487.95 | No |
churn2.shape
(7043, 12)
Observation 2: There are 7043 Observations / Rows and 12 Attributes / Columns in churn2 dataset
C. Merge both the DataFrames on key ‘customerID’ to form a single DataFrame [2 Mark]
# Merging churn1 and chun2 dataset based on the Customer ID
cdata = pd.merge(churn1, churn2, on = 'customerID')
cdata.head(10)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| 5 | 9305-CDSKC | Female | 0 | No | No | 8 | Yes | Yes | Fiber optic | No | ... | Yes | No | Yes | Yes | Month-to-month | Yes | Electronic check | 99.65 | 820.5 | Yes |
| 6 | 1452-KIOVK | Male | 0 | No | Yes | 22 | Yes | Yes | Fiber optic | No | ... | No | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 89.10 | 1949.4 | No |
| 7 | 6713-OKOMC | Female | 0 | No | No | 10 | No | No phone service | DSL | Yes | ... | No | No | No | No | Month-to-month | No | Mailed check | 29.75 | 301.9 | No |
| 8 | 7892-POOKP | Female | 0 | Yes | No | 28 | Yes | Yes | Fiber optic | No | ... | Yes | Yes | Yes | Yes | Month-to-month | Yes | Electronic check | 104.80 | 3046.05 | Yes |
| 9 | 6388-TABGU | Male | 0 | No | Yes | 62 | Yes | No | DSL | Yes | ... | No | No | No | No | One year | No | Bank transfer (automatic) | 56.15 | 3487.95 | No |
10 rows × 21 columns
cdata.shape
(7043, 21)
Observation 3 : There are totally 21 variables which includes 20 independent variable and one target variable Churn
cdata.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| SeniorCitizen | 7043.0 | 0.162147 | 0.368612 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
| tenure | 7043.0 | 32.371149 | 24.559481 | 0.00 | 9.0 | 29.00 | 55.00 | 72.00 |
| MonthlyCharges | 7043.0 | 64.761692 | 30.090047 | 18.25 | 35.5 | 70.35 | 89.85 | 118.75 |
Observation 4: Only 3 (SeniorCitizen, tenure, MonthlyCharges) of the columns are numerical that is been seen here in describe.
cdata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.2+ MB
Observation 4: The variables like gender, partner,dependent etc are objects datatype for model building, converting to categorical datatype
for feature in cdata.columns: # Loop through all columns in the dataframe
if cdata[feature].dtype == 'object': # Only apply for columns with categorical strings
cdata[feature] = pd.Categorical(cdata[feature])# Replace strings with an integer
cdata.head(10)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| 5 | 9305-CDSKC | Female | 0 | No | No | 8 | Yes | Yes | Fiber optic | No | ... | Yes | No | Yes | Yes | Month-to-month | Yes | Electronic check | 99.65 | 820.5 | Yes |
| 6 | 1452-KIOVK | Male | 0 | No | Yes | 22 | Yes | Yes | Fiber optic | No | ... | No | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 89.10 | 1949.4 | No |
| 7 | 6713-OKOMC | Female | 0 | No | No | 10 | No | No phone service | DSL | Yes | ... | No | No | No | No | Month-to-month | No | Mailed check | 29.75 | 301.9 | No |
| 8 | 7892-POOKP | Female | 0 | Yes | No | 28 | Yes | Yes | Fiber optic | No | ... | Yes | Yes | Yes | Yes | Month-to-month | Yes | Electronic check | 104.80 | 3046.05 | Yes |
| 9 | 6388-TABGU | Male | 0 | No | Yes | 62 | Yes | No | DSL | Yes | ... | No | No | No | No | One year | No | Bank transfer (automatic) | 56.15 | 3487.95 | No |
10 rows × 21 columns
cdata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null category 1 gender 7043 non-null category 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null category 4 Dependents 7043 non-null category 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null category 7 MultipleLines 7043 non-null category 8 InternetService 7043 non-null category 9 OnlineSecurity 7043 non-null category 10 OnlineBackup 7043 non-null category 11 DeviceProtection 7043 non-null category 12 TechSupport 7043 non-null category 13 StreamingTV 7043 non-null category 14 StreamingMovies 7043 non-null category 15 Contract 7043 non-null category 16 PaperlessBilling 7043 non-null category 17 PaymentMethod 7043 non-null category 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null category 20 Churn 7043 non-null category dtypes: category(18), float64(1), int64(2) memory usage: 981.9 KB
Comments : All the object datatype are converted to categorical datatypes
D. Verify if all the columns are incorporated in the merged DataFrame by using simple comparison Operator in Python. [1 Marks]
churn1.columns
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity'],
dtype='object')
churn2.columns
Index(['customerID', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
cdata.columns
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
churn1.columns.isin(cdata.columns)
array([ True, True, True, True, True, True, True, True, True,
True])
churn2.columns.isin(cdata.columns)
array([ True, True, True, True, True, True, True, True, True,
True, True, True])
for i in churn1.columns.isin(cdata.columns):
if(i==False):
print("Column"+i+ "not available")
if False in churn1.columns.isin(cdata.columns):
print("Missing columns")
else:
print("All columns of churn1 are available in merged dataframe")
All columns of churn1 are available in merged dataframe
isin() will check whether all columns of Churn1 available in cdata. if any of the column name missing it show False In the below code, checks whether the churn1 and churn2 columns are available in the new in new dataframe built.
if False in churn1.columns.isin(cdata.columns) and False in churn2.columns.isin(cdata.columns):
print("Missing columns")
else:
print("All columns available")
All columns available
combined=list(set(churn1.columns.append(churn2.columns)))
print(combined)
['StreamingTV', 'customerID', 'Dependents', 'gender', 'PaperlessBilling', 'MonthlyCharges', 'PaymentMethod', 'StreamingMovies', 'Partner', 'InternetService', 'Contract', 'TotalCharges', 'TechSupport', 'MultipleLines', 'OnlineBackup', 'SeniorCitizen', 'OnlineSecurity', 'tenure', 'Churn', 'DeviceProtection', 'PhoneService']
print(len(combined))
print(len(cdata.columns))
21 21
print(len(set(combined).difference(cdata.columns)))
0
if len(set(churn1.columns.append(churn2.columns)).difference(cdata.columns))==0:
print("All columns of churn1 and churn2 are available in merged dataset")
else:
print("Few columns are missing in the new dataset")
All columns of churn1 and churn2 are available in merged dataset
churn1.columns.union(churn2.columns)
Index(['Churn', 'Contract', 'Dependents', 'DeviceProtection',
'InternetService', 'MonthlyCharges', 'MultipleLines', 'OnlineBackup',
'OnlineSecurity', 'PaperlessBilling', 'Partner', 'PaymentMethod',
'PhoneService', 'SeniorCitizen', 'StreamingMovies', 'StreamingTV',
'TechSupport', 'TotalCharges', 'customerID', 'gender', 'tenure'],
dtype='object')
churn1.columns.union(churn2.columns).isin(cdata.columns)
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True])
len(churn1.columns.union(churn2.columns))
21
# This method will accept any 2 source dataframes and one destination dataframe as input to check the missed columns
# This function will return True if any columns missing and False otherwise
def isMissingColumnsInMergedData(source1, source2, destination):
#The columns of source1 and Source are united which will have all the united columns
# comarisonresult would have the array of verification of availability of each column in destination
comparisonResult=source1.columns.union(source2.columns).isin(destination)
#if any one of the column is missed in the merged data set the comparison Result would have value False
result = False in comparisonResult
return result
print("Merged data has missed columns ",isMissingColumnsInMergedData(churn1,churn2,cdata))
Merged data has missed columns False
Observation 5 :It is been verified that all the columns of churn1 and churn2 are available in the new dataset cdata
2. Data Cleaning & Analysis: [15 Marks]
cdata.mean() # print the mean of each attribute. Ignore "Senior Citizen as it is not a continuous variable"
SeniorCitizen 0.162147 tenure 32.371149 MonthlyCharges 64.761692 dtype: float64
cdata["SeniorCitizen"].mode() # Most of the customer are not senior citizen
0 0 dtype: int64
cdata.median()
SeniorCitizen 0.00 tenure 29.00 MonthlyCharges 70.35 dtype: float64
dupes = cdata.duplicated()
sum(dupes)
0
Observation 6 : There are no duplicates available in the dataset
A. Impute missing/unexpected values in the DataFrame. [2 Marks]
cdata.isnull().sum()
customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 0 Churn 0 dtype: int64
cdata.dtypes
customerID category gender category SeniorCitizen int64 Partner category Dependents category tenure int64 PhoneService category MultipleLines category InternetService category OnlineSecurity category OnlineBackup category DeviceProtection category TechSupport category StreamingTV category StreamingMovies category Contract category PaperlessBilling category PaymentMethod category MonthlyCharges float64 TotalCharges category Churn category dtype: object
cdata.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
cdata.tail()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7038 | 6840-RESVB | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | ... | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.5 | No |
| 7039 | 2234-XADUH | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | ... | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.9 | No |
| 7040 | 4801-JZAZL | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | 8361-LTMKD | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.6 | Yes |
| 7042 | 3186-AJIEK | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | ... | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.5 | No |
5 rows × 21 columns
# This function would return all the categorical variables of a dataset
def findCategoricalData(data):
categorical_cols=data.select_dtypes(exclude=[np.number]).columns
return categorical_cols
# This function would return all the Numerical variables of a dataset
def findNumericalData(data):
numerical_cols=data.select_dtypes(include=[np.number]).columns
return numerical_cols
categoricalColumns =findCategoricalData(cdata)
print(categoricalColumns)
Index(['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService',
'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges',
'Churn'],
dtype='object')
numericalColumns=findNumericalData(cdata)
print(numericalColumns)
Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')
# Print the valuecounts of Categorical columns to verify the unexpected data
for col in cdata[categoricalColumns]:
print(cdata[col].value_counts())
0002-ORFBO 1
6616-AALSR 1
6625-UTXEW 1
6625-IUTTT 1
6625-FLENO 1
..
3352-RICWQ 1
3352-ALMCK 1
3351-NQLDI 1
3351-NGXYI 1
9995-HOTOH 1
Name: customerID, Length: 7043, dtype: int64
Male 3555
Female 3488
Name: gender, dtype: int64
No 3641
Yes 3402
Name: Partner, dtype: int64
No 4933
Yes 2110
Name: Dependents, dtype: int64
Yes 6361
No 682
Name: PhoneService, dtype: int64
No 3390
Yes 2971
No phone service 682
Name: MultipleLines, dtype: int64
Fiber optic 3096
DSL 2421
No 1526
Name: InternetService, dtype: int64
No 3498
Yes 2019
No internet service 1526
Name: OnlineSecurity, dtype: int64
No 3088
Yes 2429
No internet service 1526
Name: OnlineBackup, dtype: int64
No 3095
Yes 2422
No internet service 1526
Name: DeviceProtection, dtype: int64
No 3473
Yes 2044
No internet service 1526
Name: TechSupport, dtype: int64
No 2810
Yes 2707
No internet service 1526
Name: StreamingTV, dtype: int64
No 2785
Yes 2732
No internet service 1526
Name: StreamingMovies, dtype: int64
Month-to-month 3875
Two year 1695
One year 1473
Name: Contract, dtype: int64
Yes 4171
No 2872
Name: PaperlessBilling, dtype: int64
Electronic check 2365
Mailed check 1612
Bank transfer (automatic) 1544
Credit card (automatic) 1522
Name: PaymentMethod, dtype: int64
11
20.2 11
19.75 9
20.05 8
19.9 8
..
260.8 1
260.7 1
2599.95 1
2598.95 1
999.9 1
Name: TotalCharges, Length: 6531, dtype: int64
No 5174
Yes 1869
Name: Churn, dtype: int64
Observation:
B. Make sure all the variables with continuous values are of ‘Float’ type. [2 Marks]
customerdata =cdata.copy()
customerdata['TotalCharges'] = pd.to_numeric(customerdata['TotalCharges'], errors='coerce')
customerdata['TotalCharges'].dtypes
dtype('float64')
Comments: Datatype of Total charges been converted to float datatype
customerdata.dtypes
customerID category gender category SeniorCitizen int64 Partner category Dependents category tenure int64 PhoneService category MultipleLines category InternetService category OnlineSecurity category OnlineBackup category DeviceProtection category TechSupport category StreamingTV category StreamingMovies category Contract category PaperlessBilling category PaymentMethod category MonthlyCharges float64 TotalCharges float64 Churn category dtype: object
customerdata['TotalCharges'].isnull().sum()
11
Comments: There are 11 null values observed in Totalcharges
customerdata[['tenure','MonthlyCharges','TotalCharges']].head()
| tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|
| 0 | 1 | 29.85 | 29.85 |
| 1 | 34 | 56.95 | 1889.50 |
| 2 | 2 | 53.85 | 108.15 |
| 3 | 45 | 42.30 | 1840.75 |
| 4 | 2 | 70.70 | 151.65 |
Comments: Its observed that Total charges is result of multiplication of Tenure and Monthly Charges
customerdata.isnull().sum()
customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 11 Churn 0 dtype: int64
#Finding the columns which has any null value
null_columns=customerdata.columns[customerdata.isnull().any()]
noOfNullColumns=customerdata[null_columns].isnull().sum()
print(noOfNullColumns)
# Finding the rows in which null values are available
nullindices=customerdata[customerdata["TotalCharges"].isnull()][null_columns]
print(nullindices.index)
TotalCharges 11 dtype: int64 Int64Index([488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754], dtype='int64')
# Print the rows in which Total charges are null or NaN
customerdata[['tenure','MonthlyCharges','TotalCharges']].iloc[customerdata.index.isin(nullindices.index)]
| tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|
| 488 | 0 | 52.55 | NaN |
| 753 | 0 | 20.25 | NaN |
| 936 | 0 | 80.85 | NaN |
| 1082 | 0 | 25.75 | NaN |
| 1340 | 0 | 56.05 | NaN |
| 3331 | 0 | 19.85 | NaN |
| 3826 | 0 | 25.35 | NaN |
| 4380 | 0 | 20.00 | NaN |
| 5218 | 0 | 19.70 | NaN |
| 6670 | 0 | 73.35 | NaN |
| 6754 | 0 | 61.90 | NaN |
cdata_copy=customerdata.copy()
#cdata_copy['TotalCharges']=cdata_copy['TotalCharges'].fillna(cdata_copy['tenure']*cdata_copy['MonthlyCharges'])
# Impute Monthly charges as Total charges as tenure is 0 for all these rows
cdata_copy['TotalCharges']=cdata_copy['TotalCharges'].fillna(cdata_copy['MonthlyCharges'])
Comments: As per previous understanding Totalcharges is multiplication of tenure and Monthly charges. As tenure is zero for all these null values assumption made is they are not still completed the required tenure[may be not completed a month]. Hence imputed Total charges with Monthly charges
# Verifying whether all the values are imputed as expected
cdata_copy[['tenure','MonthlyCharges','TotalCharges']].iloc[cdata_copy.index.isin(nullindices.index)]
| tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|
| 488 | 0 | 52.55 | 52.55 |
| 753 | 0 | 20.25 | 20.25 |
| 936 | 0 | 80.85 | 80.85 |
| 1082 | 0 | 25.75 | 25.75 |
| 1340 | 0 | 56.05 | 56.05 |
| 3331 | 0 | 19.85 | 19.85 |
| 3826 | 0 | 25.35 | 25.35 |
| 4380 | 0 | 20.00 | 20.00 |
| 5218 | 0 | 19.70 | 19.70 |
| 6670 | 0 | 73.35 | 73.35 |
| 6754 | 0 | 61.90 | 61.90 |
Comments: Tenure is 0 for all these columns so imputed the total charges with Monthly charges
cdata_copy.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| SeniorCitizen | 7043.0 | 0.162147 | 0.368612 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| tenure | 7043.0 | 32.371149 | 24.559481 | 0.00 | 9.00 | 29.00 | 55.00 | 72.00 |
| MonthlyCharges | 7043.0 | 64.761692 | 30.090047 | 18.25 | 35.50 | 70.35 | 89.85 | 118.75 |
| TotalCharges | 7043.0 | 2279.798992 | 2266.730170 | 18.80 | 398.55 | 1394.55 | 3786.60 | 8684.80 |
Obervation :
SeniorCitizen, tenure, monthly charges and Total charges are numerical columns. Senior Citizen is categorical, Senior Citizen or not.
Tenure - is having values between 9 and 72. Mean(32.37) is greater than the Median(29.00), so the distribution is positively skewed.
Monthly charges - Having values between 18.25 to 118.75, mean(64.76) is less than median(70.35) and its negatively skewed
Total charges - Having values between 18.80 to 8684. mean(2279.79) is greater than the Median(1394.55)so the distribution is positively skewed
Reference
Positive skewness: if the mean is greater than the median, the distribution is positively skewed.
Negative skewness: If the mean is less than the median, the distribution is negatively skewed.
numericalColumns=findNumericalData(cdata_copy)
print(numericalColumns)
Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges'], dtype='object')
# Print the datatypes of numerical values
for col in cdata_copy[numericalColumns]:
print(col,cdata_copy[col].dtypes)
SeniorCitizen int64 tenure int64 MonthlyCharges float64 TotalCharges float64
Comments : Continuous values like Monthly charges and Total charges are in float datatype
C.Create a function that will accept a DataFrame as input and return pie-charts for all the appropriate Categorical features. Clearly show percentage distribution in the pie-chart. [4 Marks]
def buildPieChart(data):
colors = sns.color_palette('pastel')[0:5]
categoricalColumns=findCategoricalData(data)
# print(categoricalColumns)
for col in data[categoricalColumns]:
if (col=='customerID'):
print(' customerID is Excluded for piechart')
else:
plt.title("Pie chart for "+ col)
cdata[col].value_counts().plot.pie(autopct='%1.2f%%',shadow=True, colors=colors)
plt.show()
buildPieChart(cdata_copy)
customerID is Excluded for piechart
D. Share insights for Q2.c. [2 Marks]
Follwing are the categorical features in the customerdata set
1. gender : Female and Male customers are almost equal(Male - 50.48% and female -49.52%)
2. Partner : Customers having partner also almost equal (Yes -51.70% and No -48.30%)
3. Dependents : Only one third of the Customers having dependents and 2/3rd dont have dependents (No - 70% and Yes- 30%)
4. PhoneService : Almost 90% of people has phone service.
5. MultipleLines: Out of 90% of population has phone service, 42% has multiple lines and 48% do not have Multilines
6.InternetService : Nearly 22% of the population does not have Internet services. And in the 78% of remaining population 44% has fibre optic service and nearly 34 % has DSL service.
Following are the internet based services and so observation made on how many percentage got the specific internet based service
6.a OnlineSecurity: nearly 50% does not have online security and 28% has the online security.
6.b OnlineBackup : 44% does not have OnlineBackup and 34% has online backup
6.c.DeviceProtection: Similarly 44% does not have Device protection and 34% has Device Protection
6.d.Techsupport : 49% does not have TechSupport Service and 29% has Techsupport service
6.e.StreamintTV : 38% has StreamingTv and 40% doesnot have Streaming TV
6.f.StreamingMovies: 38.7 has StreamingMovies and 39.3% doesnot have Streaming Movies
7.Contract : More percentage (55%) of customers opted for Month to month contract followed by Two year(24%) and one year(21%) respectively.
8.PaperlessBilling : 60% people prefers paperless billing where as nearly 40% not have paperless bill
9.PaymentMethod : ElectronicCheck(33.5%) and mailed check 22.8% followed by Bank transfer(21%) and Credit card(21%)
10.Churn : Churn rate is 26.5%
E. Encode all the appropriate Categorical features with the best suitable approach. [2 Marks]
categoricalColumns = findCategoricalData(cdata_copy)
print(categoricalColumns)
Index(['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService',
'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn'],
dtype='object')
categoricalColumns=findCategoricalData(cdata_copy)
# print(categoricalColumns)
for col in cdata_copy[categoricalColumns]:
if (col=='customerID'):
pass
else:
print(cdata[col].value_counts())
Male 3555 Female 3488 Name: gender, dtype: int64 No 3641 Yes 3402 Name: Partner, dtype: int64 No 4933 Yes 2110 Name: Dependents, dtype: int64 Yes 6361 No 682 Name: PhoneService, dtype: int64 No 3390 Yes 2971 No phone service 682 Name: MultipleLines, dtype: int64 Fiber optic 3096 DSL 2421 No 1526 Name: InternetService, dtype: int64 No 3498 Yes 2019 No internet service 1526 Name: OnlineSecurity, dtype: int64 No 3088 Yes 2429 No internet service 1526 Name: OnlineBackup, dtype: int64 No 3095 Yes 2422 No internet service 1526 Name: DeviceProtection, dtype: int64 No 3473 Yes 2044 No internet service 1526 Name: TechSupport, dtype: int64 No 2810 Yes 2707 No internet service 1526 Name: StreamingTV, dtype: int64 No 2785 Yes 2732 No internet service 1526 Name: StreamingMovies, dtype: int64 Month-to-month 3875 Two year 1695 One year 1473 Name: Contract, dtype: int64 Yes 4171 No 2872 Name: PaperlessBilling, dtype: int64 Electronic check 2365 Mailed check 1612 Bank transfer (automatic) 1544 Credit card (automatic) 1522 Name: PaymentMethod, dtype: int64 No 5174 Yes 1869 Name: Churn, dtype: int64
cdata_copy.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
replaceStruct = {
"gender": {"Male": 0, "Female": 1 },
"Partner": {"No": 0, "Yes": 1 },
"Dependents": {"No": 0, "Yes": 1 },
"PhoneService": {"No": 0, "Yes": 1 },
"MultipleLines": {"No": 0, "Yes": 1, "No phone service" : 2 },
"InternetService": {"Fiber optic": 1, "DSL": 2, "No" :0},
"OnlineSecurity": {"No": 0, "Yes": 1, "No internet service" : 2 },
"OnlineBackup": {"No": 0, "Yes": 1, "No internet service" : 2 },
"OnlineSecurity": {"No": 0, "Yes": 1, "No internet service" : 2 },
"DeviceProtection": {"No": 0, "Yes": 1, "No internet service" : 2 },
"TechSupport": {"No": 0, "Yes": 1, "No internet service" : 2 },
"StreamingTV": {"No": 0, "Yes": 1, "No internet service" : 2 },
"StreamingMovies": {"No": 0, "Yes": 1, "No internet service" : 2 },
"Contract": {"Month-to-month": 1, "One year": 2,"Two year": 3 },
"PaperlessBilling": {"No": 0, "Yes": 1 },
"Churn": {"No": 0, "Yes": 1 }
}
oneHotCols=["PaymentMethod"]
encoded_data=cdata_copy.replace(replaceStruct)
encoded_data=pd.get_dummies(encoded_data, columns=oneHotCols)
encoded_data.head(10)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | StreamingMovies | Contract | PaperlessBilling | MonthlyCharges | TotalCharges | Churn | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 2 | 0 | ... | 0 | 1 | 1 | 29.85 | 29.85 | 0 | 0 | 0 | 1 | 0 |
| 1 | 5575-GNVDE | 0 | 0 | 0 | 0 | 34 | 1 | 0 | 2 | 1 | ... | 0 | 2 | 0 | 56.95 | 1889.50 | 0 | 0 | 0 | 0 | 1 |
| 2 | 3668-QPYBK | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 2 | 1 | ... | 0 | 1 | 1 | 53.85 | 108.15 | 1 | 0 | 0 | 0 | 1 |
| 3 | 7795-CFOCW | 0 | 0 | 0 | 0 | 45 | 0 | 2 | 2 | 1 | ... | 0 | 2 | 0 | 42.30 | 1840.75 | 0 | 1 | 0 | 0 | 0 |
| 4 | 9237-HQITU | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | ... | 0 | 1 | 1 | 70.70 | 151.65 | 1 | 0 | 0 | 1 | 0 |
| 5 | 9305-CDSKC | 1 | 0 | 0 | 0 | 8 | 1 | 1 | 1 | 0 | ... | 1 | 1 | 1 | 99.65 | 820.50 | 1 | 0 | 0 | 1 | 0 |
| 6 | 1452-KIOVK | 0 | 0 | 0 | 1 | 22 | 1 | 1 | 1 | 0 | ... | 0 | 1 | 1 | 89.10 | 1949.40 | 0 | 0 | 1 | 0 | 0 |
| 7 | 6713-OKOMC | 1 | 0 | 0 | 0 | 10 | 0 | 2 | 2 | 1 | ... | 0 | 1 | 0 | 29.75 | 301.90 | 0 | 0 | 0 | 0 | 1 |
| 8 | 7892-POOKP | 1 | 0 | 1 | 0 | 28 | 1 | 1 | 1 | 0 | ... | 1 | 1 | 1 | 104.80 | 3046.05 | 1 | 0 | 0 | 1 | 0 |
| 9 | 6388-TABGU | 0 | 0 | 0 | 1 | 62 | 1 | 0 | 2 | 1 | ... | 0 | 2 | 0 | 56.15 | 3487.95 | 0 | 1 | 0 | 0 | 0 |
10 rows × 24 columns
encoded_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7043 entries, 0 to 7042 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null category 1 gender 7043 non-null int64 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null int64 4 Dependents 7043 non-null int64 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null int64 7 MultipleLines 7043 non-null int64 8 InternetService 7043 non-null int64 9 OnlineSecurity 7043 non-null int64 10 OnlineBackup 7043 non-null int64 11 DeviceProtection 7043 non-null int64 12 TechSupport 7043 non-null int64 13 StreamingTV 7043 non-null int64 14 StreamingMovies 7043 non-null int64 15 Contract 7043 non-null int64 16 PaperlessBilling 7043 non-null int64 17 MonthlyCharges 7043 non-null float64 18 TotalCharges 7043 non-null float64 19 Churn 7043 non-null int64 20 PaymentMethod_Bank transfer (automatic) 7043 non-null uint8 21 PaymentMethod_Credit card (automatic) 7043 non-null uint8 22 PaymentMethod_Electronic check 7043 non-null uint8 23 PaymentMethod_Mailed check 7043 non-null uint8 dtypes: category(1), float64(2), int64(17), uint8(4) memory usage: 1.4 MB
sns.pairplot(encoded_data, diag_kind= 'kde')
plt.show()
#Understand the Target variable distribution
encoded_data['Churn'].value_counts()
0 5174 1 1869 Name: Churn, dtype: int64
# count plot on single categorical variable
sns.countplot(x ='Churn', data = encoded_data)
plt.title("Count Plot for the Target variable")
# Show the plot
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
#define data
labels = ['Churn']
#define Seaborn color palette to use
colors = sns.color_palette('pastel')[0:5]
plt.title("Pie chart for the Target variable")
encoded_data['Churn'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True)
plt.show()
corr =encoded_data.corr()
print(corr)
gender SeniorCitizen Partner \
gender 1.000000 0.001874 0.001808
SeniorCitizen 0.001874 1.000000 0.016479
Partner 0.001808 0.016479 1.000000
Dependents -0.010517 -0.211185 0.452676
tenure -0.005106 0.016567 0.379697
PhoneService 0.006488 0.008576 0.017706
MultipleLines 0.000485 0.099883 0.090981
InternetService -0.000863 0.032310 -0.000891
OnlineSecurity 0.003429 -0.210897 0.081850
OnlineBackup 0.002032 -0.152780 0.087055
DeviceProtection -0.005092 -0.157095 0.094451
TechSupport -0.000985 -0.223770 0.069072
StreamingTV -0.001156 -0.130130 0.080127
StreamingMovies 0.000191 -0.120802 0.075779
Contract -0.000126 -0.142554 0.294806
PaperlessBilling 0.011754 0.156530 -0.014877
MonthlyCharges 0.014569 0.220173 0.096848
TotalCharges 0.000087 0.102997 0.317532
Churn 0.008612 0.150889 -0.150448
PaymentMethod_Bank transfer (automatic) 0.016024 -0.016159 0.110706
PaymentMethod_Credit card (automatic) -0.001215 -0.024135 0.082029
PaymentMethod_Electronic check -0.000752 0.171718 -0.083852
PaymentMethod_Mailed check -0.013744 -0.153477 -0.095125
Dependents tenure PhoneService \
gender -0.010517 -0.005106 0.006488
SeniorCitizen -0.211185 0.016567 0.008576
Partner 0.452676 0.379697 0.017706
Dependents 1.000000 0.159712 -0.001762
tenure 0.159712 1.000000 0.008448
PhoneService -0.001762 0.008448 1.000000
MultipleLines -0.016875 0.242279 -0.691070
InternetService -0.044590 0.030359 -0.387436
OnlineSecurity 0.190523 0.145298 0.125353
OnlineBackup 0.162445 0.178651 0.150338
DeviceProtection 0.156439 0.178649 0.138755
TechSupport 0.180832 0.144459 0.123350
StreamingTV 0.140395 0.136145 0.171538
StreamingMovies 0.125820 0.140781 0.165205
Contract 0.243187 0.671607 0.002247
PaperlessBilling -0.111377 0.006152 0.016505
MonthlyCharges -0.113890 0.247900 0.247398
TotalCharges 0.062124 0.826164 0.113203
Churn -0.164221 -0.352229 0.011942
PaymentMethod_Bank transfer (automatic) 0.052021 0.243510 0.007556
PaymentMethod_Credit card (automatic) 0.060267 0.233006 -0.007721
PaymentMethod_Electronic check -0.150642 -0.208363 0.003062
PaymentMethod_Mailed check 0.059071 -0.233852 -0.003319
MultipleLines InternetService \
gender 0.000485 -0.000863
SeniorCitizen 0.099883 0.032310
Partner 0.090981 -0.000891
Dependents -0.016875 -0.044590
tenure 0.242279 0.030359
PhoneService -0.691070 -0.387436
MultipleLines 1.000000 0.340949
InternetService 0.340949 1.000000
OnlineSecurity -0.235021 -0.607788
OnlineBackup -0.210372 -0.658287
DeviceProtection -0.200463 -0.662957
TechSupport -0.232155 -0.609795
StreamingTV -0.202414 -0.712890
StreamingMovies -0.195815 -0.709020
Contract 0.078613 -0.099721
PaperlessBilling 0.108230 0.138625
MonthlyCharges 0.146153 0.323260
TotalCharges 0.250647 0.175771
Churn 0.019423 0.047291
PaymentMethod_Bank transfer (automatic) 0.050046 0.017581
PaymentMethod_Credit card (automatic) 0.052168 0.032540
PaymentMethod_Electronic check 0.060190 0.091881
PaymentMethod_Mailed check -0.168056 -0.152481
OnlineSecurity OnlineBackup ... \
gender 0.003429 0.002032 ...
SeniorCitizen -0.210897 -0.152780 ...
Partner 0.081850 0.087055 ...
Dependents 0.190523 0.162445 ...
tenure 0.145298 0.178651 ...
PhoneService 0.125353 0.150338 ...
MultipleLines -0.235021 -0.210372 ...
InternetService -0.607788 -0.658287 ...
OnlineSecurity 1.000000 0.751661 ...
OnlineBackup 0.751661 1.000000 ...
DeviceProtection 0.749040 0.740604 ...
TechSupport 0.791225 0.754095 ...
StreamingTV 0.701976 0.720671 ...
StreamingMovies 0.704984 0.716700 ...
Contract 0.389978 0.351267 ...
PaperlessBilling -0.334003 -0.262402 ...
MonthlyCharges -0.621227 -0.538454 ...
TotalCharges -0.154370 -0.086208 ...
Churn -0.332819 -0.291449 ...
PaymentMethod_Bank transfer (automatic) 0.051817 0.050891 ...
PaymentMethod_Credit card (automatic) 0.066737 0.056526 ...
PaymentMethod_Electronic check -0.358367 -0.301832 ...
PaymentMethod_Mailed check 0.286446 0.233808 ...
StreamingMovies Contract \
gender 0.000191 -0.000126
SeniorCitizen -0.120802 -0.142554
Partner 0.075779 0.294806
Dependents 0.125820 0.243187
tenure 0.140781 0.671607
PhoneService 0.165205 0.002247
MultipleLines -0.195815 0.078613
InternetService -0.709020 -0.099721
OnlineSecurity 0.704984 0.389978
OnlineBackup 0.716700 0.351267
DeviceProtection 0.766821 0.390216
TechSupport 0.737123 0.418440
StreamingTV 0.809608 0.327951
StreamingMovies 1.000000 0.330993
Contract 0.330993 1.000000
PaperlessBilling -0.211818 -0.176733
MonthlyCharges -0.424598 -0.074195
TotalCharges -0.073165 0.446911
Churn -0.207256 -0.396713
PaymentMethod_Bank transfer (automatic) 0.028839 0.186440
PaymentMethod_Credit card (automatic) 0.032189 0.210659
PaymentMethod_Electronic check -0.219951 -0.342575
PaymentMethod_Mailed check 0.187321 -0.004882
PaperlessBilling MonthlyCharges \
gender 0.011754 0.014569
SeniorCitizen 0.156530 0.220173
Partner -0.014877 0.096848
Dependents -0.111377 -0.113890
tenure 0.006152 0.247900
PhoneService 0.016505 0.247398
MultipleLines 0.108230 0.146153
InternetService 0.138625 0.323260
OnlineSecurity -0.334003 -0.621227
OnlineBackup -0.262402 -0.538454
DeviceProtection -0.276326 -0.513440
TechSupport -0.310749 -0.597594
StreamingTV -0.203907 -0.423067
StreamingMovies -0.211818 -0.424598
Contract -0.176733 -0.074195
PaperlessBilling 1.000000 0.352150
MonthlyCharges 0.352150 1.000000
TotalCharges 0.158562 0.651182
Churn 0.191825 0.193356
PaymentMethod_Bank transfer (automatic) -0.016332 0.042812
PaymentMethod_Credit card (automatic) -0.013589 0.030550
PaymentMethod_Electronic check 0.208865 0.271625
PaymentMethod_Mailed check -0.205398 -0.377437
TotalCharges Churn \
gender 0.000087 0.008612
SeniorCitizen 0.102997 0.150889
Partner 0.317532 -0.150448
Dependents 0.062124 -0.164221
tenure 0.826164 -0.352229
PhoneService 0.113203 0.011942
MultipleLines 0.250647 0.019423
InternetService 0.175771 0.047291
OnlineSecurity -0.154370 -0.332819
OnlineBackup -0.086208 -0.291449
DeviceProtection -0.078601 -0.281465
TechSupport -0.142167 -0.329852
StreamingTV -0.076855 -0.205742
StreamingMovies -0.073165 -0.207256
Contract 0.446911 -0.396713
PaperlessBilling 0.158562 0.191825
MonthlyCharges 0.651182 0.193356
TotalCharges 1.000000 -0.198347
Churn -0.198347 1.000000
PaymentMethod_Bank transfer (automatic) 0.185994 -0.117937
PaymentMethod_Credit card (automatic) 0.182913 -0.134302
PaymentMethod_Electronic check -0.059268 0.301919
PaymentMethod_Mailed check -0.295740 -0.091683
PaymentMethod_Bank transfer (automatic) \
gender 0.016024
SeniorCitizen -0.016159
Partner 0.110706
Dependents 0.052021
tenure 0.243510
PhoneService 0.007556
MultipleLines 0.050046
InternetService 0.017581
OnlineSecurity 0.051817
OnlineBackup 0.050891
DeviceProtection 0.048459
TechSupport 0.055556
StreamingTV 0.027200
StreamingMovies 0.028839
Contract 0.186440
PaperlessBilling -0.016332
MonthlyCharges 0.042812
TotalCharges 0.185994
Churn -0.117937
PaymentMethod_Bank transfer (automatic) 1.000000
PaymentMethod_Credit card (automatic) -0.278215
PaymentMethod_Electronic check -0.376762
PaymentMethod_Mailed check -0.288685
PaymentMethod_Credit card (automatic) \
gender -0.001215
SeniorCitizen -0.024135
Partner 0.082029
Dependents 0.060267
tenure 0.233006
PhoneService -0.007721
MultipleLines 0.052168
InternetService 0.032540
OnlineSecurity 0.066737
OnlineBackup 0.056526
DeviceProtection 0.069131
TechSupport 0.067946
StreamingTV 0.026884
StreamingMovies 0.032189
Contract 0.210659
PaperlessBilling -0.013589
MonthlyCharges 0.030550
TotalCharges 0.182913
Churn -0.134302
PaymentMethod_Bank transfer (automatic) -0.278215
PaymentMethod_Credit card (automatic) 1.000000
PaymentMethod_Electronic check -0.373322
PaymentMethod_Mailed check -0.286049
PaymentMethod_Electronic check \
gender -0.000752
SeniorCitizen 0.171718
Partner -0.083852
Dependents -0.150642
tenure -0.208363
PhoneService 0.003062
MultipleLines 0.060190
InternetService 0.091881
OnlineSecurity -0.358367
OnlineBackup -0.301832
DeviceProtection -0.303490
TechSupport -0.360472
StreamingTV -0.215427
StreamingMovies -0.219951
Contract -0.342575
PaperlessBilling 0.208865
MonthlyCharges 0.271625
TotalCharges -0.059268
Churn 0.301919
PaymentMethod_Bank transfer (automatic) -0.376762
PaymentMethod_Credit card (automatic) -0.373322
PaymentMethod_Electronic check 1.000000
PaymentMethod_Mailed check -0.387372
PaymentMethod_Mailed check
gender -0.013744
SeniorCitizen -0.153477
Partner -0.095125
Dependents 0.059071
tenure -0.233852
PhoneService -0.003319
MultipleLines -0.168056
InternetService -0.152481
OnlineSecurity 0.286446
OnlineBackup 0.233808
DeviceProtection 0.225717
TechSupport 0.283947
StreamingTV 0.189047
StreamingMovies 0.187321
Contract -0.004882
PaperlessBilling -0.205398
MonthlyCharges -0.377437
TotalCharges -0.295740
Churn -0.091683
PaymentMethod_Bank transfer (automatic) -0.288685
PaymentMethod_Credit card (automatic) -0.286049
PaymentMethod_Electronic check -0.387372
PaymentMethod_Mailed check 1.000000
[23 rows x 23 columns]
fig, ax = plt.subplots(figsize=(10,10))
#sns.heatmap(df2.corr(), center=0, cmap='BrBG', annot=True)
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(encoded_data.corr(), center=0, cmap='mako', annot=True, fmt='.2f', linewidths=0.05, mask=mask)
ax.set_title('Heat map of customer data for Churn')
sns.pairplot(encoded_data, hue = 'Churn', diag_kind = 'kde',
plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}
)
# Title
plt.suptitle('Pair Plot of CustomerData based on the Churn',
size = 24);
F.Split the data into 80% train and 20% test. [1 Marks]
encoded_data.columns
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'MonthlyCharges', 'TotalCharges', 'Churn',
'PaymentMethod_Bank transfer (automatic)',
'PaymentMethod_Credit card (automatic)',
'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check'],
dtype='object')
# drop "CustomerID" a along with the target column
X = encoded_data.drop(columns= ['customerID','Churn'])
y = encoded_data.Churn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=10)
X_train.shape
(5634, 22)
X_test.shape
(1409, 22)
encoded_data.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | StreamingMovies | Contract | PaperlessBilling | MonthlyCharges | TotalCharges | Churn | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 2 | 0 | ... | 0 | 1 | 1 | 29.85 | 29.85 | 0 | 0 | 0 | 1 | 0 |
| 1 | 5575-GNVDE | 0 | 0 | 0 | 0 | 34 | 1 | 0 | 2 | 1 | ... | 0 | 2 | 0 | 56.95 | 1889.50 | 0 | 0 | 0 | 0 | 1 |
| 2 | 3668-QPYBK | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 2 | 1 | ... | 0 | 1 | 1 | 53.85 | 108.15 | 1 | 0 | 0 | 0 | 1 |
| 3 | 7795-CFOCW | 0 | 0 | 0 | 0 | 45 | 0 | 2 | 2 | 1 | ... | 0 | 2 | 0 | 42.30 | 1840.75 | 0 | 1 | 0 | 0 | 0 |
| 4 | 9237-HQITU | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | ... | 0 | 1 | 1 | 70.70 | 151.65 | 1 | 0 | 0 | 1 | 0 |
5 rows × 24 columns
encoded_data.describe()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | ... | StreamingMovies | Contract | PaperlessBilling | MonthlyCharges | TotalCharges | Churn | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | ... | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 |
| mean | 0.495244 | 0.162147 | 0.483033 | 0.299588 | 32.371149 | 0.903166 | 0.615505 | 1.127077 | 0.720006 | 0.778220 | ... | 0.821241 | 1.690473 | 0.592219 | 64.761692 | 2279.798992 | 0.265370 | 0.219225 | 0.216101 | 0.335794 | 0.228880 |
| std | 0.500013 | 0.368612 | 0.499748 | 0.458110 | 24.559481 | 0.295752 | 0.656039 | 0.737796 | 0.796885 | 0.778472 | ... | 0.761725 | 0.833755 | 0.491457 | 30.090047 | 2266.730170 | 0.441561 | 0.413751 | 0.411613 | 0.472301 | 0.420141 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 1.000000 | 0.000000 | 18.250000 | 18.800000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 1.000000 | 0.000000 | 35.500000 | 398.550000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 29.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 70.350000 | 1394.550000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 55.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 2.000000 | 1.000000 | 89.850000 | 3786.600000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 72.000000 | 1.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | ... | 2.000000 | 3.000000 | 1.000000 | 118.750000 | 8684.800000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 23 columns
y_train.head()
1182 0 4328 0 6091 1 4870 0 4683 0 Name: Churn, dtype: int64
G.Normalize/Standardize the data with the best suitable approach. [2 Marks]
# convert the features into z scores as we do not know what units / scales were used and store them in new dataframe
# It is always adviced to scale numeric attributes in models that calculate distances.
from scipy.stats import zscore
XScaled = X.apply(zscore) # convert all attributes to Z scale
XScaled.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| gender | 7043.0 | -1.387818e-16 | 1.000071 | -0.990532 | -0.990532 | -0.990532 | 1.009559 | 1.009559 |
| SeniorCitizen | 7043.0 | 6.417792e-16 | 1.000071 | -0.439916 | -0.439916 | -0.439916 | -0.439916 | 2.273159 |
| Partner | 7043.0 | 5.934956e-17 | 1.000071 | -0.966622 | -0.966622 | -0.966622 | 1.034530 | 1.034530 |
| Dependents | 7043.0 | -8.310515e-17 | 1.000071 | -0.654012 | -0.654012 | -0.654012 | 1.529024 | 1.529024 |
| tenure | 7043.0 | 5.945991e-17 | 1.000071 | -1.318165 | -0.951682 | -0.137274 | 0.921455 | 1.613701 |
| PhoneService | 7043.0 | -2.087449e-15 | 1.000071 | -3.054010 | 0.327438 | 0.327438 | 0.327438 | 0.327438 |
| MultipleLines | 7043.0 | 1.092489e-15 | 1.000071 | -0.938280 | -0.938280 | 0.586128 | 0.586128 | 2.110535 |
| InternetService | 7043.0 | 2.733942e-16 | 1.000071 | -1.527734 | -0.172250 | -0.172250 | 1.183234 | 1.183234 |
| OnlineSecurity | 7043.0 | -3.673762e-16 | 1.000071 | -0.903589 | -0.903589 | 0.351386 | 0.351386 | 1.606361 |
| OnlineBackup | 7043.0 | 5.502406e-16 | 1.000071 | -0.999747 | -0.999747 | 0.284912 | 0.284912 | 1.569572 |
| DeviceProtection | 7043.0 | 8.414081e-16 | 1.000071 | -0.998016 | -0.998016 | 0.286059 | 0.286059 | 1.570134 |
| TechSupport | 7043.0 | 1.435109e-16 | 1.000071 | -0.909172 | -0.909172 | 0.347362 | 0.347362 | 1.603896 |
| StreamingTV | 7043.0 | -2.520780e-16 | 1.000071 | -1.071457 | -1.071457 | 0.238887 | 0.238887 | 1.549232 |
| StreamingMovies | 7043.0 | -8.764346e-16 | 1.000071 | -1.078210 | -1.078210 | 0.234693 | 0.234693 | 1.547597 |
| Contract | 7043.0 | -1.072864e-16 | 1.000071 | -0.828207 | -0.828207 | -0.828207 | 0.371271 | 1.570749 |
| PaperlessBilling | 7043.0 | -5.924868e-16 | 1.000071 | -1.205113 | -1.205113 | 0.829798 | 0.829798 | 0.829798 |
| MonthlyCharges | 7043.0 | -8.291599e-17 | 1.000071 | -1.545860 | -0.972540 | 0.185733 | 0.833833 | 1.794352 |
| TotalCharges | 7043.0 | -1.269277e-16 | 1.000071 | -0.997542 | -0.829998 | -0.390568 | 0.664794 | 2.825857 |
| PaymentMethod_Bank transfer (automatic) | 7043.0 | -6.454206e-16 | 1.000071 | -0.529885 | -0.529885 | -0.529885 | -0.529885 | 1.887201 |
| PaymentMethod_Credit card (automatic) | 7043.0 | 2.888188e-16 | 1.000071 | -0.525047 | -0.525047 | -0.525047 | -0.525047 | 1.904590 |
| PaymentMethod_Electronic check | 7043.0 | 4.010233e-16 | 1.000071 | -0.711026 | -0.711026 | -0.711026 | 1.406418 | 1.406418 |
| PaymentMethod_Mailed check | 7043.0 | -8.683795e-16 | 1.000071 | -0.544807 | -0.544807 | -0.544807 | -0.544807 | 1.835513 |
#Split with ScaledX
X_train, X_test, y_train, y_test = train_test_split(XScaled, y, test_size=0.20, random_state=10)
3. Model building and Improvement: [10 Marks]
A. Train a model using XGBoost. Also print best performing parameters along with train and test performance. [5 Marks]
import xgboost as xgb
# fit model no training data
import xgboost as xgb
from xgboost import XGBClassifier
#model = XGBClassifier(n_estimators=10,seed=123, enable_categorical=True,tree_method='gpu_hist')
model = XGBClassifier()
print(model.get_params)
#model =xgb.XGBClassifier(max_depth=5, learning_rate=0.08, objective= 'binary:logistic',n_jobs=-1,enable_categorical=True)
model.fit(X_train, y_train)
#print(model)
<bound method XGBModel.get_params of XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None,
enable_categorical=False, gamma=None, gpu_id=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None, subsample=None,
tree_method=None, validate_parameters=None, verbosity=None)>
[20:58:05] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
# Predict the labels of the test set: preds
preds = model.predict(X_test)
# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
accuracy: 0.797729
def timer(start_time=None):
if not start_time:
start_time = datetime.now()
return start_time
elif start_time:
thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
tmin, tsec = divmod(temp_sec, 60)
print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))
B. Improve performance of the XGBoost as much as possible. Also print best performing parameters along with train and test performance. [5 Marks]
In k-fold cross validation all the entries in the original training dataset are used for both training as well as validation. Also, each entry is used for validation just once. XGBoost supports k-fold cross validation via the cv() method. specify the nfolds parameter, which is the number of cross validation sets you want to build. Also, it supports following other parameters
num_boost_round: denotes the number of trees you build (analogous to n_estimators)
metrics: tells the evaluation metrics to be watched during CV
as_pandas: to return the results in a pandas DataFrame.
early_stopping_rounds: finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.
seed: for reproducibility of results.
params = {"objective":"reg:squarederror",
'colsample_bytree': 0.3,
'learning_rate': 0.1,
'max_depth': 5,
'alpha': 10}
data_dmatrix = xgb.DMatrix(data=X,label=y)
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
cv_results.head()
| train-rmse-mean | train-rmse-std | test-rmse-mean | test-rmse-std | |
|---|---|---|---|---|
| 0 | 0.484554 | 0.000296 | 0.484874 | 0.000339 |
| 1 | 0.468370 | 0.001197 | 0.468903 | 0.001644 |
| 2 | 0.453211 | 0.000936 | 0.454046 | 0.001924 |
| 3 | 0.443123 | 0.000439 | 0.444213 | 0.001817 |
| 4 | 0.433874 | 0.001258 | 0.435111 | 0.002731 |
print((cv_results["test-rmse-mean"]).tail(1))
49 0.368221 Name: test-rmse-mean, dtype: float64
## Hyper Parameter Optimization
params={
"learning_rate" : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
"max_depth" : [ 3, 4, 5, 6, 8, 10, 12, 15],
"min_child_weight" : [ 1, 3, 5, 7 ],
"gamma" : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
"colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ],
"eval_metric" :['mlogloss','error']
}
## Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
classifier=xgb.XGBClassifier()
random_search=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)
from datetime import datetime
# Here we go
start_time = timer(None) # timing starts from this point for "start_time" variable
#random_search.fit(X,y)
random_search.fit(X_train, y_train)
timer(start_time)
Fitting 5 folds for each of 5 candidates, totalling 25 fits Time taken: 0 hours 0 minutes and 10.65 seconds.
X.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | ... | StreamingTV | StreamingMovies | Contract | PaperlessBilling | MonthlyCharges | TotalCharges | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 2 | 0 | 1 | ... | 0 | 0 | 1 | 1 | 29.85 | 29.85 | 0 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 | 0 | 34 | 1 | 0 | 2 | 1 | 0 | ... | 0 | 0 | 2 | 0 | 56.95 | 1889.50 | 0 | 0 | 0 | 1 |
| 2 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 2 | 1 | 1 | ... | 0 | 0 | 1 | 1 | 53.85 | 108.15 | 0 | 0 | 0 | 1 |
| 3 | 0 | 0 | 0 | 0 | 45 | 0 | 2 | 2 | 1 | 0 | ... | 0 | 0 | 2 | 0 | 42.30 | 1840.75 | 1 | 0 | 0 | 0 |
| 4 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 70.70 | 151.65 | 0 | 0 | 1 | 0 |
5 rows × 22 columns
random_search.best_estimator_
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.5,
enable_categorical=False, eval_metric='mlogloss', gamma=0.0,
gpu_id=-1, importance_type=None, interaction_constraints='',
learning_rate=0.05, max_delta_step=0, max_depth=5,
min_child_weight=3, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto',
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
subsample=1, tree_method='exact', validate_parameters=1,
verbosity=None)
random_search.best_params_
{'min_child_weight': 3,
'max_depth': 5,
'learning_rate': 0.05,
'gamma': 0.0,
'eval_metric': 'mlogloss',
'colsample_bytree': 0.5}
classifier=xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.3,
enable_categorical=False, eval_metric='error', gamma=0.1,
gpu_id=-1, importance_type=None, interaction_constraints='',
learning_rate=0.25, max_delta_step=0, max_depth=4,
min_child_weight=7, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto',
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
subsample=1, tree_method='exact', validate_parameters=1,
verbosity=None)
classifier.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.3,
enable_categorical=False, eval_metric='error', gamma=0.1,
gpu_id=-1, importance_type=None, interaction_constraints='',
learning_rate=0.25, max_delta_step=0, max_depth=4,
min_child_weight=7, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto',
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
subsample=1, tree_method='exact', validate_parameters=1,
verbosity=None)
# Predict the labels of the test set: preds
preds = classifier.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
accuracy: 0.806246
accuracyScore=accuracy_score(y_test, preds)
print(accuracyScore)
0.8062455642299503
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, preds)
array([[960, 106],
[167, 176]], dtype=int64)
# calculate accuracy measures and confusion matrix
cm=metrics.confusion_matrix(y_test, preds, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No churn","Churn"]],
columns = [i for i in ["Predict No Churn","Predict Churn"]])
plt.figure(figsize = (6,6))
sns.heatmap(df_cm, annot=True,linewidths=2, linecolor='white', fmt='g', annot_kws={"size":15})
<AxesSubplot:>
from sklearn.model_selection import cross_val_score
score=cross_val_score(classifier,X_train,y_train,cv=10)
score
array([0.80319149, 0.81737589, 0.78191489, 0.76241135, 0.80461812,
0.79218472, 0.80639432, 0.78685613, 0.81172291, 0.79218472])
score.mean()
0.7958854540644723
PART B
• DOMAIN: IT
• CONTEXT: The purpose is to build a machine learning pipeline that will work autonomously irrespective of Data and users can save efforts
involved in building pipelines for each dataset.
• PROJECT OBJECTIVE: Build a machine learning pipeline that will run autonomously with the csv file and return best performing model.
• STEPS AND TASK [30 Marks]:
Purpose
Approach to Build the Pipeline to get a data file and return best Performing model
Approach Description:
For Building a Pipeline following approach is taken. The Main code will pass the data file to the subclasses. The subclasses is modularized to do the following activity
Modularization:
1. Data Analysis
2. Data Transformation
3. Data Visualization
4. Split Test and Train
5. Model Building and returning the best model
Import the required Libraries
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score, confusion_matrix
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#from sklearn.feature_extraction.text import Cou
Preprocessor - Class will be responsible for doing all the preprocessing before Model Building. It will use below classes to do the functionalities
1. DataAnalyser - To Analyse the data
2. DataVisualizer- To visualize the data in graphs and charts
3. DataTransformer - To do the clean up, imputation etc and give the required X and Y set for model Building
DataPreprocessor - is the Main class which would use all the 3 above classes to do the necessary preprocessing
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
#Import svm model
from sklearn import svm
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
## method to build data frame from csv file
def buildDataframeFromCSV(csvFileName):
data= pd.read_csv(csvFileName)
return data
This includes
1. Checking the shape, 5 point summary<br>
2. verifying the Null values<br>
3. Finding categorical columns<br>
4. Method to convert Object to numeric<br>
5. Find column which is index -which expected to be dropped<br>
6. Find numerical column if in Object data Type<br>
class DataAnalyser:
"""
Class used ot Analyse the Data
...
Attributes
----------
data : data
The name of the data frame for which analysis will Happen
Methods
-------
analyseShape()
Prints the shaope of the data, number of row and columns
findIfNullColumns()
Finds the percentage of null columns and prints it
checkDataTypes()
prints the dataTypes of each column
check5PointSummary()
prints 5 pointSummary
findCategoricalData()
Finds the categorical columns in a dataset
printCategoricalData
Print categorical columns in a dataset
findNumericalData
Prints the Numerical Data
convertdataType
Convert the object to Numberical
convertObjectsToCategorical
convert objects to categorical
findColumnsToDrop
Find columns to Drop
findNumericalValuesInObject
Find Numberical Values in object data
"""
#class Constructor
def __init__(self, data ):
self.data=data
## Method to find the shape of data
def analyseShape(self,data):
print("Details of Data features")
print("---------------------------")
print('The number of rows in the data',data.shape[0])
print('The number of features in the data',data.shape[1])
print("Feature Names")
print("----------------")
print(data.columns)
## Method to find the percentage of null columns
def findIfNullColumns(self,data):
print("Percentage of Null Values")
print("--------------------------")
percent_missing = data.isnull().sum() * 100 / len(data)
missing_value_data = pd.DataFrame({'Feature Name': data.columns,
'percent_missing': percent_missing})
print(missing_value_data)
## Method to check the data types of The data frame
def checkDataTypes(self,data):
print("Data types of all the features")
print("-------------------------------")
print(data.dtypes)
## Method to Print the 5 Point Summary
def check5PointSummary(self,data):
print("5 point summary of the data")
print("------------------------------")
print(data.describe().T)
# This function would return all the categorical variables of a dataset
def findCategoricalData(self,data):
categorical_cols=data.select_dtypes(exclude=[np.number]).columns
return categorical_cols
# This function would Print all the categorical variables of a dataset
def printCategoricalData(self,data):
print("Categorical Columns")
print("---------------------")
categorical_cols=data.select_dtypes(exclude=[np.number]).columns
print(categorical_cols)
# This function would return all the Numerical variables of a dataset
def findNumericalData(self,data):
numerical_cols=data.select_dtypes(include=[np.number]).columns
return numerical_cols
# This function would print all the Numerical variables of a dataset
def printNumericalData(self,data):
print("Numerical Columns")
print("---------------------")
numerical_cols=data.select_dtypes(include=[np.number]).columns
print(numerical_cols)
#converting the data type from one data type to another
@staticmethod
def convertdataType(data,columnName,fromType,toType):
# If ‘coerce’, then invalid parsing will be set as NaN.
print('datatype before conversion', data[columnName].dtype)
data[columnName]=pd.to_numeric(data[columnName], errors='coerce')
print('datatype after conversion', data[columnName].dtype)
## Method to convert objects to categorical Values
@staticmethod
def convertObjectsToCategorical(data):
for feature in data.columns: # Loop through all columns in the dataframe
if data[feature].dtype == 'object': # Only apply for columns with categorical strings
data[feature] = pd.Categorical(data[feature])# Replace strings with an integer
data.head(5)
return data;
## Method to find the column which has to be dropped
## This method will identify the index / identifier values in a dataset which will not be useful for model building
@staticmethod
def findColumnsToDrop(data):
list_of_cols = list(data.select_dtypes(['object']).columns)
print("The features that are not helpful for model building")
print("-----------------------------------------------------")
dropcolumns=[]
for cols in list_of_cols:
var = (data[cols].nunique() - len(data[cols]))* 100 / len(data[cols])
if (abs(var) == 0):
dropcolumns=cols
return dropcolumns
## This method would find number of columns having the continuous data in the Object
@staticmethod
def findNumericalValuesInObject(data):
numericalColumn=[]
for feature in data.columns: # Loop through all columns in the dataframe
if data[feature].dtype == 'object': # Only apply for columns with categorical strings
# data[feature] = pd.Categorical(data[feature])# Replace strings with an integer
#print(data[feature].value_counts())
percentage=data[feature].nunique()/len(data[feature])*100
if(abs(percentage)==100):
print('features having values 100',feature)
elif(abs(percentage)>=90):
print('features more than 90',feature)
numericalColumn=feature
else:
pass
return numericalColumn
class DataTransformer:
"""
Class used to do the required Transformations in data
...
Methods
-------
convertObjectsToCategorical()
Converts the Object datatypes to categorical
dropcolumns()
drops the specified column from the dataset
convertdataTypeToNumeric()
convert a column to Numeric when its object type and contain numeric value
removeNullValues()
Remove null values
"""
## Method to convert objects to categorical Values
# @staticmethod
def convertObjectsToCategorical(self):
for feature in data.columns: # Loop through all columns in the dataframe
if data[feature].dtype == 'object': # Only apply for columns with categorical strings
data[feature] = pd.Categorical(data[feature])# Replace strings with an integer
def dropcolumns(self,cols): # if the required column name is passed that column will be dropped
#df.drop(['C', 'D'], axis = 1)
print('cols to drop=',cols)
if(cols in data.columns):
data.drop(cols, axis = 1, inplace=True)
#converting the data type from object to Number
def convertdataTypeToNumeric(self,columnName):
# If ‘coerce’, then invalid parsing will be set as NaN.
print('datatype before conversion', data[columnName].dtype)
data[columnName]=pd.to_numeric(data[columnName], errors='coerce')
print('datatype after conversion', data[columnName].dtype)
#Remove Null values
def removeNullValues(self):
if(data.isnull().values.any()):
print("Null value found")
data.dropna(how='any', inplace=True)
def encodeCategoricalValues(self):
dataAnalyser = DataAnalyser(data)
categoricalColumns=dataAnalyser.findCategoricalData(data)
for col in data[categoricalColumns]:
data[col]=data[col].cat.codes
return data
class DataVisualizer:
"""
Class used to do Visualization of Data
...
Methods
-------
plotCorrelation()
to plot the correlation values
plotBoxPlotData()
To plot the Box Plot
plotCountPlot()
To plot the count Plot
plotPairPlot()
To plot pair plot
"""
def plotCorrelation(self):
# Correlation matrix for all variables
corr = data.corr()
mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize = (11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap = True)
#sns.heatmap(data.corr(), center=0, cmap='mako', annot=True, fmt='.2f', linewidths=0.05)
sns.heatmap(corr, mask = mask, cmap = cmap, square = True, linewidths = .5, annot=True, cbar_kws = {"shrink": .5})#, annot = True)
ax.set_title('Correlation Matrix of Data')
def plotBoxPlotData(self):
fig = plt.figure(figsize =(10, 7))
plot_data= data.select_dtypes(exclude=['category','object'])
# Creating plot
plt.boxplot(plot_data)
plt.title('Box Plot for Customer churn')
# show plot
plt.show()
def plotCountPlot(self,targetVariable):
# count plot on single categorical variable
sns.countplot(x =targetVariable, data = data)
plt.title("Count Plot for the Target variable"+targetVariable)
# Show the plot
plt.show()
def plotPairPlot(self,dataName,targetVariable):
sns.pairplot(data, hue = targetVariable, diag_kind = 'kde',
plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}
)
# Title
plt.suptitle('Pair Plot of '+dataName,
size = 24);
def buildPieChart(self):
colors = sns.color_palette('pastel')[0:5]
dataAnalyser = DataAnalyser(data)
categoricalColumns=dataAnalyser.findCategoricalData(data)
print('Pie chart for all categorical Columns')
# print(categoricalColumns)
for col in data[categoricalColumns]:
#n = data[col].nunique()
#activities = np.arange(n)
#slices = np.arange(n)
#patches, texts = plt.pie(slices,colors=colors,startangle=90,labels=slices)
#labels = ['{0} - {1:1.2f} %'.format(i, j) for i, j in zip(activities,100.*slices/slices.sum())]
plt.title("Pie chart for "+ col)
#plt.legend(patches, labels, loc='center left', bbox_to_anchor=(-0.35, .5), fontsize=8)
data[col].value_counts().plot.pie(autopct='%1.2f%%',shadow=True, colors=colors)
plt.show()
class Preprocessor:
"""
Class have the methods to invoke the methods of DataAnalyser, DataTransformer and Datavisualizer
helps in invoking all the methods in sequence
...
Methods
-------
analyseData()
invokes all the methods of DataAnalyser in sequence
visualizeData()
invokes all the methods of DataVisualizer in sequence
transformData()
Invokes the default transformation requirement for a dataset
encodeData()
Transforms all categorical values to numerical for model building
"""
def analyseData(self,data):
dataAnalyser=DataAnalyser(data)
dataAnalyser.findIfNullColumns(data)
dataAnalyser.analyseShape(data)
dataAnalyser.checkDataTypes(data)
dataAnalyser.check5PointSummary(data)
dataAnalyser.printCategoricalData(data)
dataAnalyser.printNumericalData(data)
self.columnsToDrop=dataAnalyser.findColumnsToDrop(data)
self.objectColumnsHaveNumeric=dataAnalyser.findNumericalValuesInObject(data)
print("columnsToDrop",self.columnsToDrop,'\nobjectColumnsHaveNumeric',self.objectColumnsHaveNumeric)
# Method to call all the visualization options
def visualizeData(self):
dataVisualizer = DataVisualizer()
#dataVisualizer.plotCorrelation()
dataVisualizer.plotBoxPlotData()
targetVariable=''
targetVariable=targetVariable.join(data.iloc[:,-1:].columns.values)
print(targetVariable)
dataVisualizer.plotCountPlot(targetVariable)
dataVisualizer.buildPieChart()
dataVisualizer.plotPairPlot('Customer churn',targetVariable)
# Method to do the necessary convertion like droping the null columns, changing the datatypes
def transformData(self,colsToDrop,colsToConvert):
#convertObjectsToCategorical(data)
print("converted objects into category")
#trans_data.dtypes()
dataTransformer = DataTransformer()
dataTransformer.dropcolumns(colsToDrop)
dataTransformer.convertdataTypeToNumeric(colsToConvert)
dataTransformer.convertObjectsToCategorical()
dataTransformer.removeNullValues()
# Method to convert the categorical columns to numerical datatype for model building
def encodeData(self):
dataTransformer = DataTransformer()
dataTransformer.encodeCategoricalValues()
#data.head()
# Split data to X , Y for model building
def splitXY(self):
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
return X,y
#Method to split Train and Test Data
def splitTrainAndTest(self):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)
return X_train, X_test, y_train, y_test
def getBestModel(data):
## Pipelines Creation
## 1. Data Preprocessing by using Standard Scaler
## 2. Reduce Dimension using PCA
## 3. Apply Classifier
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
('pca1',PCA(n_components=2)),
('lr_classifier',LogisticRegression(random_state=0))])
pipeline_dt=Pipeline([('scalar2',StandardScaler()),
('pca2',PCA(n_components=2)),
('dt_classifier',DecisionTreeClassifier())])
pipeline_randomforest=Pipeline([('scalar3',StandardScaler()),
('pca3',PCA(n_components=2)),
('rf_classifier',RandomForestClassifier())])
pipeline_knn=Pipeline([('scalar4',StandardScaler()),
('pca4',PCA(n_components=2)),
('knn_classifier',KNeighborsClassifier(n_neighbors = 3))])
pipeline_svm=Pipeline([('scalar5',StandardScaler()),
('pca5',PCA(n_components=2)),
('svm_classifier',svm.SVC())])
## Create the list of pipelines
pipelines = [pipeline_lr, pipeline_dt, pipeline_randomforest,pipeline_knn,pipeline_svm]
best_accuracy=0.0
best_classifier=0
best_pipeline=""
bestmodel=""
# Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'RandomForest',3: 'KNN',4: 'SVM'}
# Fit the pipelines
for pipe in pipelines:
pipe.fit(X_train, y_train)
for i,model in enumerate(pipelines):
print("{} Test Accuracy: {}".format(pipe_dict[i],model.score(X_test,y_test)))
for i,model in enumerate(pipelines):
if model.score(X_test,y_test)>best_accuracy:
best_accuracy=model.score(X_test,y_test)
best_pipeline=model
best_classifier=i
print('Classifier with best accuracy:{}'.format(pipe_dict[best_classifier]))
#print(best_classifier)
best_model=format(pipe_dict[best_classifier])
#print(best_model)
return best_model;
data=buildDataframeFromCSV('TelcomCustomer-Churn_2.csv')
preprocessor=Preprocessor()
preprocessor.analyseData(data)
#preprocessor.visualizeData()
preprocessor.transformData(preprocessor.columnsToDrop,preprocessor.objectColumnsHaveNumeric )
preprocessor.visualizeData()
preprocessor.encodeData()
X,y=preprocessor.splitXY()
X_train, X_test, y_train, y_test=preprocessor.splitTrainAndTest()
bestModel=getBestModel(data)
Percentage of Null Values
--------------------------
Feature Name percent_missing
customerID customerID 0.0
OnlineBackup OnlineBackup 0.0
DeviceProtection DeviceProtection 0.0
TechSupport TechSupport 0.0
StreamingTV StreamingTV 0.0
StreamingMovies StreamingMovies 0.0
Contract Contract 0.0
PaperlessBilling PaperlessBilling 0.0
PaymentMethod PaymentMethod 0.0
MonthlyCharges MonthlyCharges 0.0
TotalCharges TotalCharges 0.0
Churn Churn 0.0
Details of Data features
---------------------------
The number of rows in the data 7043
The number of features in the data 12
Feature Names
----------------
Index(['customerID', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
Data types of all the features
-------------------------------
customerID object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object
5 point summary of the data
------------------------------
count mean std min 25% 50% 75% \
MonthlyCharges 7043.0 64.761692 30.090047 18.25 35.5 70.35 89.85
max
MonthlyCharges 118.75
Categorical Columns
---------------------
Index(['customerID', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'TotalCharges', 'Churn'],
dtype='object')
Numerical Columns
---------------------
Index(['MonthlyCharges'], dtype='object')
The features that are not helpful for model building
-----------------------------------------------------
features having values 100 customerID
features more than 90 TotalCharges
columnsToDrop customerID
objectColumnsHaveNumeric TotalCharges
converted objects into category
cols to drop= customerID
datatype before conversion object
datatype after conversion float64
Null value found
Churn
Pie chart for all categorical Columns
Logistic Regression Test Accuracy: 0.7571103526734926 Decision Tree Test Accuracy: 0.6996587030716723 RandomForest Test Accuracy: 0.7360637087599545 KNN Test Accuracy: 0.7309442548350398 SVM Test Accuracy: 0.7514220705346986 Classifier with best accuracy:Logistic Regression
print(bestModel)
Logistic Regression
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# Fit the model on train
model = LogisticRegression()
model.fit(X_train, y_train)
LogisticRegression()
#Predict the response for test dataset
y_pred = model.predict(X_test)
from sklearn import metrics
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.7764505119453925
model.coef_
array([[-2.26491844e-01, -6.95919607e-02, -2.46689113e-01,
1.03047326e-01, 8.04626574e-02, -1.17091213e+00,
3.54278471e-01, 3.57597966e-04, 2.20950860e-02,
-2.62686829e-04]])
model.intercept_
array([-1.49071874])
import pickle
with open('model_pickle','wb') as file:
pickle.dump(model,file)
with open('model_pickle','rb') as file:
mp = pickle.load(file)
mp.coef_
array([[-2.26491844e-01, -6.95919607e-02, -2.46689113e-01,
1.03047326e-01, 8.04626574e-02, -1.17091213e+00,
3.54278471e-01, 3.57597966e-04, 2.20950860e-02,
-2.62686829e-04]])
mp.intercept_
array([-1.49071874])
#Predict the response for test dataset
y_pred = model.predict(X_test)
from sklearn import metrics
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.7764505119453925
Observation And Conclusion:
Through the pipe Line created able to send a dataframe and can identify the best performing model based on Accuracy. pipeline library is the industry approach used to predict the models by sending the preprocessed data.
The best model is identified and same is saved in the pickle file which can be used for further prediction. In this product we can able to save the model and able to retrieve. When we used for prediction same accuracy received. By this way the time spent for training can be drastically reduced.